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ABSTRACT 

The availability of large scale multitasked parallel architectures introduces the following 
processor assignment problem for pipelined computations. Given a set of tasks and their 
precedence constraints, along with their experimentally determined individual response times 
for different processor sizes, find an assignment of processors to tasks. Two objectives interest 
us: minimal response given a throughput requirement, and maximal throughput given a 
response time requirement. These assignment problems differ considerably from the classical 
mapping problem in which several tasks share a processor; instead, we assume that a large 
number of processors are to be assigned to a relatively small number of tasks. In this paper 
we develop efficient assignment algorithms for different classes of task structures. For a p 
processor system and a series-parallel precedence graph with n constituent tasks, we provide 
an 0(np 2 ) algorithm that finds the optimal assignment for the response time optimization 
problem; we find the assignment optimizing the constrained throughput in 0(np 2 logp) time. 
Special cases of linear, independent, and tree graphs are also considered. In addition, we 
also examine more efficient algorithms when certain restrictions are placed on the problem 
parameters. Our techniques are applied to a task system in computer vision. 


’Research was supported by the National Aeronautics and Space Administration under NASA Contract 
No. NAS1-18605 while the author was in residence at the Institute for Computer Applications in Science 
and Engineering, NASA Langley Research Center, Hampton, VA 23665-5225. 
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1 Introduction 


In recent years much research has been devoted to the problem of mapping large computations onto 
a system of parallel processors. Various aspects of the general problem have been studied, including 
different parallel architectures, task structures, communication issues and load balancing [11, 16]. 
Typically, experimentally observed performance (e.g., speedup or response time) is tabulated as a 
function of the number of processors employed. We are particularly interested in tabulations of 
response time, which we will refer to as response-time functions. Our work is also motivated by the 
growing availability of multitasked parallel architectures, such as PASM [37], the NCube system 
[18], and Intel’s iPSC system [7], in which it is possible to map tasks to processors and allow parallel 
execution of multiple tasks in different logical partitions. 

In this paper, we consider the problem of optimizing performance of a task structure on a 
parallel architecture, given a large supply of processors, and the experimentally determined response 
time functions for its constituent tasks. The task structure describes the sequencing of various 
computational activities (tasks) that are to be applied to each of many data sets; the data sets 
themselves are pipelined through the task structure. We refer to this class of computations as 
pipeline computations. This problem arises in data parallel applications such as the computer 
vision example we consider in this paper, when individual tasks, e.g. a fast Fourier transform, 
are highly parallelizable. Unlike prior treatments of the mapping problem we are interested in 
the case where there are many more processors than tasks. Rather than ask which tasks must 
share a processor, we ask how many processors each task should be allocated. We are interested 
in both the response time of the task structure on one data set, and in the throughput (data sets 
processed per unit time). We consider the dual problems of minimizing response time subject to a 
throughput constraint, and maximizing throughput subject to a response time constraint. These 
problems are complimentary, in the sense that allocation to increase throughput may have the side 
effect of increasing response time, and vice versa. 

Under the assumption that the constituent task response time functions completely characterize 
performance, we show that p processors can be optimally allocated to an n-node series-parallel task 
structure in 0(np 2 ) time. We study separately the special cases of linear, and tree structures and 
show a 0(np 2 ) procedure; we also consider response time function characteristics such as convexity 
which are exploited to achieve even more efficient algorithms. Our methods are applied to the task 
of motion estimation in a computer vision system; we present several experimental results for both 
the response time as well as the throughput problem. 

The problem of mapping workload to processors has attracted a great deal of attention in 
the literature, leading to a number of problem formulations. One often views the computation in 
terms of a graph, where nodes represent computations and edges represent communication; for an 
example, see [2]. In this case, mapping means assigning each node (task) to a processor. One view 
of the mapping problem is that the computation graph represents a distributed program, with a 
serial thread of control. Tasks have different affinities for different heterogeneous processors; the 
problem is to assign tasks to processors so that the total sum of execution times (of all tasks) 
and communication costs is minimized. Fundamental contributions to this problem are made in 
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[4, 39, 41]. However, the objective function for this problem does not capture any parallelism among 
the tasks. Another mapping problem formulation views the architecture as a graph whose nodes 
are processors and whose edges identify processors able to communicate directly. The dilation 
of a computation graph edge (u, v) is the minimum distance (in the processor graph) between 
the processors to which u and v are respectively assigned. The dilation of the graph itself is the 
maximum dilation among all computation graph edges. Dilation is a measure of how well the 
mapping preserves locality between nodes in the mapped computation graph. Results concerning 
the minimization of dilation can be found in [8, 19, 32, 36], and their references. Yet another 
formulation directly models execution time of a data parallel computation as a function of the 
chosen mapping, and attempts to find a mapping that minimizes the execution time. Workload 
may again be represented as a graph, with edges representing data communication. Nodes are 
mapped to processors in such a way that each processor’s w'orkload is approximately the same, for 
example, see [1, 5, 24, 33, 35]. Formulations using simulated annealing or neural networks attempt 
to minimize an “energy” function that hcuristically quantifies the cost of the partition [6, 1 ' ] * 
Other interesting formulations consider mapping highly structured computations onto pipelined 
multiprocessors [25], and mapping systolic algorithms onto hypercubes [22]. The problem we study 
is distinctly different than these, in that it seeks the assignment of multiple processors to a task, 
rather than multiple tasks to a processor. 

Recently, some studies consider the scheduling of tasks on multitasked parallel architectures 
where each task can be assigned a set of processors. The objective in such work, for example 
in [3, 13, 27], is to find a schedule that minimizes completion time. A fundamental difference, 
between the processor assignment problem studied in this paper and the above scheduling problems, 
is that scheduling formulations allow tasks to be queued or sequenced. In contrast, the nature 
of pipeline computations recommends assigning at least one processor to each task: executable 
images which would be swapped into main memory for each data set under scheduling, would 
remain in main memory under our assignment formulation. The problem of assigning processors 
to a set of independent tasks where each task is a chain of modules is considered in [10], This 
differs from our problem, as neither response-time functions nor task precedence is treated. In 
other formulations, each task requires a specific number of processors; in this case, the problem of 
scheduling tasks on a partitionable hypercube or mesh connected' architectures has been studied 
[9, 14, 23, 29]. Pipeline computations arc studied in [25, 38]. In [38], heuristics are given for 
scheduling planar acyclic task structures and in [25], a methodology is presented for analyzing 
pipeline computations using Petri nets together with techniques for partitioning computations. We 
have not discovered treatments that address optimal processor assignment to pipeline computations, 
although our solution approach (dynamic programming) is related to those in [4] and [41]. 

This paper is organized as follows. Section §2 introduces notation, and formalizes the response- 
time problem and the throughput problem. Section §3 develops some preliminary results about 
response time functions that will be used throughout the paper. Section §4 closely examines two 
response-time problems associated with linear arrays of tasks, and Section §5 applies these results to 
tasks structured as trees or more general series-parallel graphs. Section §6 shows how the problem 
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Table 1: Example of Response time functions 

of maximizing throughput subject to a response-time constraint can be solved using solutions to 
the response-time problem. Section §7 discusses application of our techniques to actual problems, 
and Section §8 summarizes this work. 

2 Problem Definition 

A pipeline computation is a quadruple V =< K, T,F,G > where 

• K = {l,-..,p} is a set of identical processors. 

• T = t n + 1 } is a set of tasks labeled such that t x is always the first task and < n+1 the last 

task executed on each data set. We will assume that the last task t n+1 is a “dummy” task 

that requires no processing — it is used for convenience of notation in the graph G, described 
below. 

• F - {/i> -o/n+x} is a collection of response-time functions /,• : K -> JR + for each task. For 
notational convenience we assume that /,(0) = oo for all i = 1, . . . , n. We also assume that 
fn+i(x) = 0 for all x, so that no processors need ever be assigned to the dummy task. It 
is often convenient to think of the discrete function /, as a table, a format we shall use in 

this paper. Later, we will also use F to denote the response time functions for a whole task 
structure. 

• G = ( T,E ) is a directed acyclic graph (DAG) describing the precedence relation for the tasks 
in T. Thus, ( t{,tj ) £ E if t,- immediately precedes tj. 

An example of response time table for ix — 5 and p — 8 is shown in Table 1. Each row of the 
entire table is a response time function for a particular task. In the course of the paper we will be 
constructing examples to demonstrate the use of our algorithms for various graph structures; these 
examples will use the response time functions in this table. 

Our definition of a pipeline computation extends earlier definitions [25, 38] to include the em- 
pirically determined response-time functions. Observe that f t {k) may include the communication 
costs inherent in executing f,- on k processors, as well as the communication costs <,■ may suffer 
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communicating with predecessor and/or successor tasks in T. Tins paper assumes tha a pet or- 
mance dependencies on communication are captured in tile response ton. functions Our problem 
formulation does not therefore attempt to deal with any issues related to matching the task 
structure topology to the architecture topology. It implicitly assumes that performance is indepen- 
dent „f which processors are assigned to a task. These assumptions are reasonable when the cost 
of communication is largely independent of the distance between communicating processors (as is 
tlie case with the Intel iPSC/2 [7]), and the communication bandwidth is sufficiently high for us to 
ignore effects due to contention between pairs of communicating tasks. They are also reasonaj e 
for compute-bound applications, for which load-balancing of the type we study is a major concern. 

The computer vision application we later consider is compute-bound. 

Let a • t -> Z denote a feasible assignment of processors to tasks such that A{U) _p an 

A(ti) > 1 for all U where 1 < i < n. Observe that we do not require all p processors to be assigned, 
as it is possible that increasing the number of processors used actually hampers performance. In 
addition, observe that each task must be assigned at least one processor; this condition clearly 

differentiates between an assignment and a schedule. 

For a pipeline computation V and assignment (mapping) A, define the following: 

. S(V,A) -- maXKKn /i(A(«<)), the largest response time, under /l, among all tasks. 

• A CP, A) = 5('P,A) _1 . We will later argue that this quantity is the maximal throughput 
under assignment A, i.e., the maximum rate at which successive data sets can be processed 
by the task system. 

, l - {/|f is a path in G starting from h ending in t n+ 1 ). L is thus the set of all complete 
paths through G. We will write each l G L as a set .,**}, *i = M* = n + 1 > 1 ^ k ^ n + 1 ’ 

with l consisting of the edges (tjj > ti 2 )> •••> (fi*-i>^ fc )- 

. R(V,A) = ma x, eL 'Ei& the “length” of the longest path through G. R{V,A) i 

thus the total time required to execute one data set, i.e., the response time. 

With these definitions we formulate two problems. 


IS 


Response time problem: 

Given a pipeline computation V and throughput requirement A, find an assignment 

A* such that A (P,A*) > A, and R{V,A*) < R{V,A ) for every feasible assignment 

A which satisfies A (V,A) > A. 

We are also interested in determining how the optimal response time R(V,A*) behaves as a 
function of p, the maximum number of available processors. In other words, we are interested 
in obtaining the response time function for the entire computation V: the values of R(V, A*) for 
different values of p. We will call this V's optimal response time function, or sometimes simply the 
response time function (the optimality being understood). 
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Throughput problem: 


Given a pipeline computation V and response time requirement p , find an assignment 
A such that R(V , A *) < p , and A(V > A*) > A (V,A) for every feasible assignment 
A which satisfies R(V, A) < p. 

The response time problem arises when we have a steady stream of input data arriving at a fixed 
rate and the system must complete pi'ocessing each data set as soon as possible. The throughput 
pioblem arises when there is flexibility in the amount of time it takes to process one data set 
but the throughput must be maximized to handle high input data rates. Both conditions appear 
in real-time applications. Our approach will be to focus first on the response-time problem, for 
difFeicnt task structures! in Section §6 we then show how solutions to the response time problem 
can be used to solve the throughput problem. 

3 Preliminaries 

Much of this paper is devoted to the issue of decomposing a large task structure into a set of smaller 
task structures and constructing a response time function for the large structure from response time 
functions for the smaller structures. This is accomplished by first separately studying algorithms 
foi handling simple task stiuctures such as tasks in series and tasks in parallel. Then more complex 
task structures such as trees and series-parallel graphs are treated by decomposing the optimization 
procedure to handle series and parallel components of the overall task structure. 

Given x (x < p) processors and a task structure consisting only of two tasks t u t 2 , with response 
time functions /i,/2, we wish to determine y such that assigning y processors to t\ and x - y to 
t<i satisfies the throughput requirement and minimizes the overall response time. If we tabulate 
this minimal response time for each value of x, then we obtain a response time function for the 
aggregate of t x and t 2 . Note that this function captures optimality and is thus an optimal response 
time function. In general, given a set of task structures {Pi, . . . , V m }, where for j = 1, . . . , m, Vj =< 
Ii , Tj, F 3 , Gj >, we extend the notion of response time function for a single task to a response time 
function for an entire pipeline computation; let Fj \TL — *■ 1R be the response time function for Pj , 
i.e., Fj(x) is the optimal response time achieved for V 3 using x processors. Suppose also that we have 
an 772 -node graph Q that describes a precedence relation on { V\ , . . . , V m }. We may view each Vj as 
an aibitiary task, even though Vj may itself have a complex subtask structure. We wish to construct 

the optimal response time function for the structure Q = (^K, { V\ , . . . , V m ), <7^, given a 

throughput constraint A. We accomplish this by solving a number of response-time problems: for 
every x E [1 ,p] processors, we determine the minimal response time h(x) achievable by allocating no 
more than x processors among the task structures Vj in such a way that the throughput requirement 
is satisfied. h(x) becomes the optimal response time function for Q, which now can be treated as 
a task itself with a known response- time function. 

We are interested in properties of optimal response time functions that are conserved through 
such an aggregation procedure. Two questions are particularly important: (i) what is the minimum 
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number of processors needed for Q to meet the throughput constraint, and (ii) what is the maximum 
number of processors that Q should be allocated? The answer to the first question is straightforward 
whereas the answer to the second requires additional analysis. 

First consider the throughput constraint question. Let u x {V 3 ) denote the minimum number of 
processors Vj must be allocated in order to meet throughput constraint A. For a single task t t , 
u\(ti) denotes the minimum that must be assigned to task U, i.e, u\(ti) = min ke j_{k • /«(&) < }• 

Observe that any distribution of tasks to £2 must assign at least u\{fPf) processors to Vj if Q is to 
meet the throughput requirement. As this is true for each Vj, it is clear that 


m 

«a(Q)>I> a(^)- (1) 

i =1 

This is true regardless of the structure of Q. It is also true that if every Vj is allocated u x {Vj) 
processors, then Q’s throughput is at least A. One need only perform an easy induction on the 
number of nodes in the precedence graph to establish that Q’s throughput is the inverse of the 
maximal response-time among all tasks in Q. This shows that the inequality in equation (1) can be 
reversed, thereby implying equality. Thus, the rule for computing minimal processor requirements 
for Q is simple, and general: add the minimal requirements of Q’s constituent tasks. 

To answer the second question, especially when Q is complex, we need to manipulate the 
functions so that certain conditions arc satisfied. For a response time function /(*), define the 

reduced response time function f(x) as: 


/0) 


min 

0<y<x 


{/(</)} 


Note that / is monotonically decreasing (non-increasing), whereas / need not be, and can be 
defined both for single tasks as well as for whole computations by using the appropriate response 
time function. In several applications, increasing communication costs when a large number of 
processors is used can force response times to increase with increasing x. In general, we would 
like to treat response time functions that behave arbitrarily (exhibit seveial local minima) with 
increasing x. The adjustment above will prevent assigning “too many” processors. A processor 
assignment x is called reducible if 3y < x : f(y) < /(*). It is otherwise irreducible. For obvious 
reasons, we seek irreducible assignments. In the example in Table 1 the response time for task t 3 , 
i.e., / 3 (x), can be reduced while all other functions cannot. After the adjustment, we have the 
reduced response time function with / 3 ( 8) = 1.5 which assigns only 7 processors to task t 3 . 

We next derive some properties of reduced response time functions that we will later use in our 
algorithms. Consider first a simple case of two elemental tasks h and t , 2 and their aggregate, s. 
Suppose /i(x) and / 2 (x) are the response time functions for t x and t 2 and H(xi,x 2 ) is a real-valued 
function increasing in both arguments. Define 

f s (x)= min {II{fi(y)j2({x~y))}- ( 2 ) 
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Here f s is the optimal response time function of the aggregate task s, written as some function of 
the response time functions of t x and 1 2 . In this paper, II is usually a sum (for series tasks) or a 
maximum (for parallel tasks). Define 

l s ( x ) = ~ 2/))} • (3) 

We next show that: 

Lemma 3.1 For all x = 1 f s (x ) = / (z). 

Proof: We first show that f_ s (x) is monotone decreasing in x, and therefore f s (x) is already 
irreducible. Since f x and f 2 are monotone decreasing and II is increasing, for any y 

- y)) > f 2 (x + 1 - y)). 

Therefore, 

Wh(v)> M x -v))}> rni^ {II (My), f 2 (x + 1 - y ))} , 

that is, / (x) is decreasing. 

Next, for any x > y > 0, f x (y) < f x (y) and f 2 (x - y) < f 2 (x - y). Thus 

ff(/i(v),/a(* - V )) < II(My),h(x- y)) 

and hence 

/» = o“<. - »))} ^ ™^J n (My)’Mx - 2/))} = f s (x). 

As this is true for all x - 1, ... , p, it follows that 

- 0< n y<x for all z. 

But, the left-hand-side of the above is simply f g ( x) (by definition); the right-hand-side is 
fs(x) (also by definition), showing that f s (x) < f s (x) for all x = 1, . . ,,p. 

Finally, we show f_ s (x) > f s (x). For the sake of contradiction suppose 3xo : f s (x 0 ) > f (xo). 
Then 

and thus, 

Vy < x 0 : o mn^{II(h(w),f 2 (y -w))} > q min^ {II(f x (z), f 2 (x 0 - z))} . ( 4 ) 
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Next let the minimum of the right side of inequality (4) be achieved at z = z 0 with value 

with /i0o) = /i(o) and f 2 (x 0 - z Q ) = f 2 (b) for some a < z 0 , b < x 0 - z 0 and a + b < x 0 . Note 
that a and b are obtained through the reduction of /i and f 2 . We may also rewrite inequality 

(4) as 

Vw < xo : min {H(f\(w), f 2 (y — «>))} > H(f\( z o), f 2 { x o ~ z o))- (5) 

0<w<y 

But, with y = a + b < x 0 above, we get 

min {II(fi(w), f 2 (y - w))} < f 2 (b)) = H(fi(z 0 ), f 2 (x 0 - z 0 )) 

0<w<y 

which contradicts (5) and therefore, f s { x ) = / 5 ( 2 0* ^ 


Thus, we have shown that no information is lost in reduction, since the desired optimal response 
time function of the aggregate f s is obtained using the reduced response time functions of the 
constituent tasks. This is an important point: we will build up response-time functions for complex 
tasks using increasing functions H, and minimization equations of the form shown in equation (2). 
We have just shown that if we start with reduced response time functions, then w r e will construct 
reduced response time functions, and the assignments associated with them will be irreducible. 

The lemma can be generalized through an easy induction argument for multiple, complex tasks. 


Lemma 3.2 Lets u ...,s k be k complex tasks with optimal response time functions gi,...,0k and 
II (x i,...,®*) be an increasing function in each argument. If s is the task that represents the 
aggregate of tasks Si , . . .,Sfc with reduced optimal response time function h(x) and defining 


h(x) = min {H(gi(yi), ■ ■ ■ ,gk(yk))} ■ 

y\ Vk € [l.^l 

yi +... + !/*= x 


then h(x) = h(x). 

Remark 3.1 If the irreducible minimums of the functions gi, . . .,g k occur at xq, • • • , x./ : , then the 
irreducible minimum of h, Xq, satisfies x 0 < Xa=i x i- 

The last remark implies that when constructing h w'e may restrict our attention to only those 
assignment vectors (iq,..., y k ) for which £?=i Vi < Eii *.'• This wiU result in improved execution 
time for our optimization algorithms when Ei=i x i < 0(.P)- Next, w^e begin our presentation of the 
algorithms by first treating the two simpler task structures, linear series tasks and linear parallel 
tasks. 
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4 Linear Task Structures 

Linear task structures are interesting both because many pipelines are simple linear chains [25] and 
because chains appear as tasks in more complex task structures. We examine two different ways of 
assessing the cost of a linear chain. The first is when the chain is a linear pipeline, and the response 
time function is the sum of the response times of each of the ‘stages’ [25]. This is called a series 
task structure. The second is when the constituent tasks execute in parallel on different aspects 
of the same data set, a parallel task structure. For both problems we show how to construct the 
optimal response time function for the aggregate task, and, for every q = 1, . . ,,p, how to recover 
the optimal assignment of q processors from information computed as the response time function 
was constructed. 

In the treatments of both problems we consider si,...,s m to be the set of m constituent tasks, 
and <7 i,. . .,g m to be their respective response-time functions. Let s be the aggregate task whose 
optimal response time function h(x),0 < x < p, we are interested in computing. Note that each 
constituent task sj may already be an aggregation of the elemental tasks U. Our immediate goal is 
to construct the overall reduced response time function for processors in the range [l,p] and also, 
to recover the optimal assignment when required. 

4.1 Series Tasks 

First we describe an algorithm that constructs the optimal response time function h(x) for linear 
task structures when each function <7,(x) is convex (see [30], pp. 445-454) in x, i.e., when the 
efficiency of parallelism is decreasing (see pp. 217 in [16] for an example). We later treat the 
general case. 

Let the assignment be recorded in I(s,x ) = where xj denotes the number of pro- 

cessors assigned to task sj; also let ha denote the response time function created by our algorithm. 
As a first step, we must ensure that every task s,- is allocated enough processors u\(si) to meet 
the throughput constraint. For each i = l,...,m, let x,- = u\(si) be this initial assignment. Of 
course, the algorithm terminates at this point if ^ > p, because no feasible assignment exists. 

Note that this first step does not require the presumed convexity of each <7,. Let t = £7* Xi ; 
we set hfj(x) = 00 for all x < t to reflect an inability to meet the throughput requirement, set 
licit) = £ELi Oii x i), and let x — t. Next, for each s,-, compute d(i,Xi) = <7,(x,- -f 1) - <7,(x,), the 
change in response time achieved by allocating one more processor to s,-. Build a max-priority heap 
[20] where the priority of Si is |d(i,x t )|. Finally, enter a loop where, on each iteration, 

• The task (say Sj) with highest priority is allocated another processor. 

• Let a denote the number of processors previously assigned to sj. Compute Iig(x) - ha{x - 
1 ) + d(j,a), and set I(s,x) = (x + l,...,x fc ). 

• Increment x. 

• Compute Sj's new priority, and adjust the priority heap accordingly. 
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\\r e iterate until all available processors have been assigned, or the top element of the heap is non- 
negative, i.e., d(j,Xj) is non-negative. If the top element becomes non-negative when x — y, then 

we assign h G (z) = h G (y — 1) and I(s, z ) = I(s,y — 1) for all z — y, . . ■ ,p. 

Each iteration of the loop allocates the next processor to the task which stands to benefit most 
from the allocation. When the individual task response functions are convex, then the greedy 
response time function h G it produces is optimal, and is irreducible. 

Prop. 4.1 Suppose that (ji{k) is convex over x £ [l,p], for all i = 1, ... , n. Then for all x £ [l,p], 
h a (x) = h(x), the optimal response time function. Furthermore, h G {x) is irreducible. 

Proof: Clearly, each task s i must receive at least u\(si ) tasks in order for the throughput 

condition to be satisfied. Recalling that t = ZT=i «>(*), it is clear that h G {x ) = h{x ) = oo 
for all a: £ [1, t — 1]. Now consider x = t. For all j - 1 the remainder of the algorithm 

should assign “the next” j processors in such a way to obtain the maximal possible decrease 
in response time given j additional processors. The proposed algorithm does exactly that. 
D = { d(i , x;+ j)|l < i < n,l < j < p-x} is the set of all possible changes for the remainder of 
the assignment. For every j = 1, . . .,p- t, the maximal decrease is obtained by choosing the 
j largest (in magnitude) elements of T). bince each [j, is convex, | h (z, x t T Ji)| ^ | d(i,x, — F y 2 ) i 
for ji > ; 2 (see [30], pp. 453-454) and so the j elements with largest magnitude in D are 
selected as given in the algorithm. 

The irreducibility of Iiq follows from its construction. 


The complexity of this algorithm is low. The throughput condition is checked in m steps. 
The initial priority heap is constructed in O(mlogm) time; the highest priority heap element is 
found in 0(1) time and each heap adjustment requires only O(logm) time using standard heap 
algorithms. Thus the overall complexity is O(mlogm) + O(plogm) = 0(p log in). This is an 
example of how the structure of the response time function (convexity) can be used to obtain 
higher algorithmic efficiency than might otherwise be achievable, as we will see below for general 
response time functions. 

A different approach, based on dynamic programming, is needed when the task response time 
functions are not convex. In fact, we anticipate that this condition will be the norm when con- 
sidering chains whose tasks are themselves aggregates of other tasks. Since convexity need not be 
preserved in aggregation, we must turn to a slightly more complicated algorithm. The new approach 
has a higher complexity — 0(mp 2 ) — but it permits completely general response time functions. We 
will show that certain algorithmic efficiencies are possible when bounds on the least minimums are 
known ahead of time.. ■ 

For any j = 1, . . ,,m, we can view the subchain s u . as a (larger) task itself. We will call 
this task Sj, and compute its optimal response time function: for x = 1 , ... ,p let G\(j, x) be the 
minimal response time of Sj, subject to throughput constraint A, achievable when no more than 
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x processors are allocated to it. The function G\(j, •) is thus Sj y s optimal response time function; 
in computing this function we will simultaneously check the throughput constraint — hence the 
subscript A. Using the principle of optimality[12], we may write a recursive definition for G\(j, x) 
as follows. 


Gx{hx) 


oo 

< tflo) 

min 

. w a(<Sj) < % < X - Ux(Sj-i) 


{9j{ 0 + G\(j — 1, ar — i)} 


if u\(sj) + u\(Sj- 1 ) > x 
if j = 1 and ?za(si) < x 
otherwise. 


( 6 ) 


These equations define response time to be oo whenever insufficiently many processors are allocated 
to sj or Sj - 1 to meet the throughput constraint; we define u\(Sq) = 0 as a boundary condition. 

Observe that h(x) = G\(rri)X y ). Note that the II function (Lemma 3.2) is the ‘sum 5 operator here, 
in the third part of the equation. 

The dynamic programming equation is more intuitively explained by reading it ‘top down 5 . 
Suppose we had somehow computed the response time table for the first j - 1 tasks (the ‘large 5 
task Sj- 1 ), i.e., Then, given x processors to distribute between tasks sj and Sj- 1 , we try 

every combination subject to the throughput constraints: i processors for Sj and x — i processors for 
Sj—\. Since the equation is written as a recursion, the computation will actually build response time 
tables for larger tasks ‘bottom up’, starting with task S\ in the second part of the equation. Note 
that similar explanations may be given for the dynamic programming equations that appear later 
in the paper. The optimal assignment of q (1 < q < p) processors to tasks is found by setting the 
appropriate value of I as we solve for the value G\(j y x). Suppose that i solves G\(j, x) = + 

G x (j - l,ar - i). Then we set I(S jy x) = (x u . ..,Xj- U i), where I(Sj- U x - i) = {x u . a ). 

An impoitant consequence of Lemma 3.2 is that each function G\(j, •) (and hence each assign- 
ment I(Sj,x)) is irreducible. This follows directly from the fact that equation (6) has the form 
specified by equation (3). The more complex bounds on the minimum’s index variable in equa- 
tion (6) serve simply to keep the index i away from regions where either Tjj(-) or G\(j — 1, •) are 
known to take value oo. 

If we have already solved for the minimal response time function G\(j - 1,-), we may use 
equation (6) to determine G\(j, ■). The cost of determining one individual G\(j, x ) value is seen to 
be O(x) = 0(p ); the cost of determining the whole function G\(j, ■) is thus 0(p 2 ), and the cost of 
determining all such functions (and hence the desired response time function G x (m, •)) is 0(mp 2 ). 

The application of the above dynamic programming procedure, in equation (6), is illustrated 
in Figure 1 (which shows the computation of G\(j, •)) for a task structure with three tasks. The 
response time functions, !Ji(x), for the three tasks < 1,^2 an d are taken from Table 1 and the 
throughput constraint A = 1/40. Since we use tasks from Table 1, we revert to using <, for the 
constituent tasks. The first column of the table identifies the aggregated task Sj, for 1 < j < 3; 
here Si = *i, S 2 = {h,h) and S 3 = (ti,t 2 ,t 3 ). A row j corresponds to the response time function 
G x {j, x ) , for aggregated task Sj', entry [fc, /] in the table (row k, column /) gives the value, and 
the corresponding assignment, for G\(k,l). The last row shows the assignment produced by the 
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Figure 1: Application of Algorithm for series tasks: G\(j,x) for 1 < j < 3 , 1 < x < 8 

algorithm; this assigns 3 processors to tasks h and t 2 and 2 processors to t 3 with minimum response 
time of 30.5 and an achieved throughput of 1/14. Note that in our example above, and m all other 
examples to follow, we have omitted the dummy task that is the last task executed on the data set, 

since it plays no role in the computation. . _ 

The dynamic programming equations can sometimes be solved more efficiently, when each g t has 

an irreducible minimum at z it and each z t is small relative to p. Suppose Zi < L for all i = 1, . • m. 
We next show how the optimality equations can be solved in 0(m 2 L 2 ) time. This is advantageous 

when L < 0(p/y/m). 

As we solve for each 6 ' a O', k). Remark 3.1 also tells us that we need not consider assigning any 
more than zj < L processors to Sj . This means we can rewrite the optimality equations as 


GaO» = 


00 

h( x ) 


ax{t/A ($j ) > £ 


mm 

_ yj-i 
L^ta - 1 


< i < Xj 


{g 3 (i) + Gx(j- I,*-*)) 


if u\{sj) + u\(Sj-i) > x 
if j = 1 and uaOi) < x 
otherwise. 


( 7 ) 


Die complex lower bound on i prohibits indexing values of i such that Sj - 1 cannot meet the through- 
put constraint, and values indexing beyond Sj's known minimum. Thus, the cost of computing 

<3 x (j,x) is only 0{L). Since we need only compute G\{j,k ) for x < E;=i tlie cost of computing 
G\(j, ■) is 0(jL 2 ), so that the cost of solving the overall problem is 0(Z/j=2 jL ) — 0(m L ). 


4.2 Parallel Tasks 

In this subproblem, we have a sequence S of tasks s u . . .,s m with irreducible response- time func- 
tions gu...,g m for which we need to determine the irreducible optimal response-time function h(x) 
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for the maximum where 


h(x) - min m 3 - x {9i(xi),g2(x2),---,g m (x m )}. 

x 1 j ■ - • > x m 

+ f* Xm ~ x 


In tliis case, the function H (in Lemma 3.2) is the maximum operator. The basic idea behind the 
algoiithm is that after processors are allocated to meet the throughput requirement, we can only 
drive the maximum response time down by allocating a processor to the task whose response time 
under the present allocation is maximal. This process is repeated until the maximum number of 
needed processors is allocated. This idea is now made more precise. 


Suppose that the irreducible minimum of each occurs at zi, and let z h = Yj?-i z >- First, 
observe that the response time function value at all processor counts smaller than t = YZLi u,\(s,) 
is oo. Thus, for * = 1, .... m, we begin by assigning u A (s;) processors to task s,-. This is also reflected 
m the initialization of the data structure recording assignments, as I(S,t) = (n A (si), . ..,u\(s m )). 
Set = oo for x = - 1, and h(t) = max 1 < i < m {^(u A (5 : ))}. Next build a max-priority 

heap on the tasks, where < 7 , •(«*(«»)) is the priority for task s t . Let x = t + 1, and enter a loop where 
the following is performed for at most Zh — t iterations. 


• Give an additional processor to the task whose priority is greatest. Let y x be that maximal 
priority. 

• If that task (say S{) was previously assigned x t processors, and if a;,- = z { , then terminate the 
algorithm. 

• If that task (say s t ) was previously assigned < z { processors, reset its new priority to 

<7i(z; + 1). Set I(S,x ) = (*!, l,...,x m ), where I(S,x- 1) = (z a , . . . ..,x m ). 

• Adjust the max-priority heap to reflect the task’s new priority, and set h(x) to the maximum 
value in the heap. 

• Increment x. 


If the loop terminates with x = y, then set h(z) - h(rj - 1) and I(S,z) = I(S,y - 1) for all 

* = y, ■■■,?■ 

The termination condition follows from the observation that if s t has the maximum response 
time but already has Z{ processors assigned, no further assignment of processors to s t can reduce 
its response time. Since the objective function is the maximum response time among tasks, that 
objective function cannot be further reduced. It is clear then that the procedure we describe 

constructs an irreducible function. The algorithm’s correctness is established with the following 
lemma. 

Lemma 4.1 For every x = t,...,p, h( x) = h(x) - y x . 
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Proof: For every i = 1, . . . , m, let St = {3, M | * = «(*), 0* «« ° f 

times for s,- following its initial assignment, anti let S - US, Si- Since the o jee ive unc ion 
value for an assignment is the maximum response time under that assignment and since we 
stop assigning processors once the objective function can no longer be minimized, 5 contains 
every value of y x generated by our algorithm. Furthermore, the sequence y t ,y t+ 1 ,---, 
scribes the elements of 5 in descending order. Now if an assignment is to achieve cost y x , 
the response time of every task must be no greater than We argue that our algorithm 
finds an assignment achieving cost y xt using the minimum number of processors. For every y, 
let T(yi) be the task from whose response-time function yi is taken. Our algont im a oca es 
an additional processor to T{ Vl ), then another to T(y 2 ), and so on. For every x = t,.„,z h 
an( | j _ i , jTn let Pj(x) be the number of elements y a with a < x for which {y a ) Sj. 
P 3 {x) is thus the number of additional processors our algorithm has allocated to Sj by the 
(1 - i) th pass through the loop, and is also the minimum number of additional processors 
(after u x ($j)) that sj must be assigned if its response is to be no greater than y x . As this is 
true for every task for every y x , it follows that the assignment generated by our algorithm 
achieves each cost y x with the minimum number of processors. The lemma s conclusion is a 

restatement of this fact. 


Since the algorithm’s loop is executed at most - t times, the overall cost of the algorithm is 
0(m log m + Zh log in) . The optimal assignment is found in 7(5, p). An example of the application 
of this algorithm is shown in the next section; in Figure 2 the row for D x shows the response time 
function (and the corresponding assignment) of a parallel task composed of tasks <i and t 2 . 

While the problems studied in this paper are distinctly different from those addressed in the 
literature a closer look reveals that the above algorithm (for parallel tasks) is a generalization 
of the algorithm independently conceived in [27]. While they address the problem of finding a 
nonpremptive schedule for a set of n independent tasks, i.e., parallel tasks, their algorithm in fact 
finds an assignment which satisfies the feasibility conditions of our problem. Our algorithm is a 
Generalization in the sense that they do not “construct” a reduced response time table for the entire 
parallel task that provides the response time as a function of the number of piocessois. us is 
essential for our solution technique which views complex task structures as composition of simpler 

task structures. 


5 Complex Tasks 

The algorithms we have developed to analyze series and parallel task structures can be used to 
analyze task-structures whose graphs form trees, or series-parallel graphs. We now show how the 
response time function for a tree task with n nodes and arbitrary branching is computed in 0(np ) 
time, and how a series-parallel task .with arbitrary branching is analyzed in 0{np 2 ) time. Note that 
the complex tasks we consider usually determine a whole pipeline computation and thus, we will 
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henceforth use n (as in Section 2) to denote the number of nodes in the task graph. Series- parallel 
graphs arise frequently in applications where data in a set is split, processed separately, and then 
rejoined. The basic idea behind our algorithms is that these complex structures can be viewed as 
a composition of series and parallel tasks, thus facilitating the use of the algorithms designed thus 
far. 

5.1 Tree Tasks 

Suppose the precedence graph for V forms a tree with n nodes. Either out-trees (edges directed to 
child nodes) or in-trees (edges directed to parent node) are permissible. Without loss of generality 
(because path lengths are unaffected by arc direction) our discussion will concern out-trees. 

For notational convenience we assume that every non-leaf node has exactly b children; our 
approach extends immediately to the general case. For every task sj, let cyi, . . . , cyj be s,’s 
children. Sj is the root of a subtree which can be viewed as a subtask Tj with its own response 
time function. Dynamic programming again expresses the optimal response time function for each 
Tj. The optimal response time function for T\ is the overall problem solution. 

Let G\(j,x) be the optimal response time achievable by Tj when subject to throughput con- 
straint A. Let 2 be the set of interior tree tasks, and C be the set of leaf tasks. The principle of 
optimality states that 

i °° if Sj 6 C and u\(sj) > x 

min {fj( x o) + max {G\(cj y i, £;)}} otherwise. 

Tq , . . - , Xfr \ < i < b 

xq -f 1- x b = k 

The formidable recursive expression simply takes the minimum cost over all possible partitionings of 
k processors among Sj and the b subtrees rooted in its children. Fortunately, the results developed 
in Section §4 may be employed to solve this equation efficiently. The subtasks Cj t \ through Cj^ form 
a single parallel task, D. The algorithm developed in the previous section constructs B ' s irreducible 
response time function in O(p\ogb) time. Next we can view Tj as a series task, composed of Sj 
and B. Given B' s response time function, Tj' s irreducible response time function is computed in 
0(p 2 ) additional time using the algorithm described in Section §4.1. Thus, the cost of computing 
the serial composition dominates. The complexity of computing response time functions for all 7) 

where Sj 6 I is 0(J2 S} €2 P 2 )- Note however that b\2\ = n, which implies that the total cost of 

processing interior tasks is 0(np 2 /b). Since the cost of processing all leaf tasks is 0(n), the total 
cost in the general case is 0(np 2 /f»). 

The procedure is illustrated by the example in Figure 2, a tree with 5 constituent tasks; here 
A = 1/40. The tasks t 4 ,t 2 form a parallel task, denoted B Y \ B x and t 3 form a series task, denoted 
T 3 . Similarly, the aggregate task T 3 and t 4 form a parallel task B 2 ; B 2 and t 5 form a series 
task T 5 whose response time gives us the response time of the entire task. Note that the tasks 
are taken from Table 1. Each row of the table shows the response time assignment for 
the corresponding aggregated task. The minimum response time achieved by the assignment is 41 
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task 

aggregates 

X 

5 6 7 8 

A?! 

16 

(2,3) 

14 

(3,3) 

11 

(3,4) 

11 

(4,4) 

t 3 

(^ 3 i Bi) 

31 

(2,2,1) 

26 

(2,3,1) 

21.5 

(2,3,2) 

19.5 

(2,3,3) 

b 2 

3) 

39 

(1,2, 1,1) 

31 

(2, 2, 1,1) 

26 

(2, 3, 1,1) 

21.5 

(2, 3, 2,1) 

n 

(^ 5 > B 2 ) 

65 

( 1 , 14 , 1 , 1 ) 

54 

(1,2, 1,1,1) 

46 

(2, 2, 1,1,1) 

41 

(2, 3, 1,1,1) 


Figure 2: Application of Algorithm for Tree Structures 

(by assigning 2 processors to t x , 3 to t 2 and one processor to each of the other three tasks) and the 
achieved throughput is 1/20. 

Better complexities are achievable when the irreducible minima z< for each s 3 satisfy z; < L 
where A < p. The computation of B’s response time function is fast 0(bL log b) time. For Sj + B, 
let z Tj be the sum of the z t values for all nodes in the subtree rooted in s } . Since we need not 
consider any assignment that gives more than z 3 processors to Sj , the response time function for 
sj .)- B is computed in 0(^zj' J L') time. This cost dominates that of computing B s response time 
function, provided that 61ogZ> < A, which we will assume here for simplicity. 

The total cost of analyzing the tree is maximized when each is as large as possible. This 

occurs when the tree is actually just a linear chain, in which case Xx n A, Xx n _ 3 2A, 2rx n _2 
3A, and so on. As we have seen, the total cost is then 0(n 2 A 2 ). The best topology is a full tree; 
for example, consider a, full binary tree. A subtree Tj consisting of exactly 3 tasks has < 3A, 
and an analysis cost of 0(3 A 2 ), n/2 such subtrees are analyzed. Then, n/4 subtrees are analyzed 
where xs < A + 3A + 3A - 7A. Each of these requires 0(7A 2 ) time to analyze. Continuing in this 
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fashion we determine a complexity bound of 


logn 

0(E|( 2 i+1 -l)Z 2 ) = 0(Z 2 nlogn). 
i= 1 Z 

5.2 Series-Parallel Tasks 

Finally, we consider series-parallel task graphs. We show that the response time function for such 
a graph (with n nodes) can be computed in O(np^) time. A number of different but equivalent 
definitions of series-parallel graphs exist. The one we will use is taken from [42], which studies 
vertex series-parallel DAGs. However, based on their results on the equivalence of edge series- 
parallel DAGs and vertex series-parallel DAGs, we use the term series-parallel to mean both cases 
and use their definition of vertex series-parallel DAGs. A series-parallel DAG (SP) is defined 
recursively as follows. 

1. (i) The DAG having a single vertex and no edges is SP. 

2. (ii) If Gi — and G 2 ~ ( V 2 , E 2 ) are two SP DAGs, so are the DAGs constructed by 

each of the following two operations: 

(a) Parallel composition'. G v — (Vi U V 2 , E\ U E 2 ). 

(b) Series composition-. G s = (Vi U V 2 , E t UE 2 U (T, x S 2 )), where T x is the set of sinks of 
G 1 and 5*2 is the set of sources of G 2 . 

A node t{ in G = (V,E) is a sink if there are no outgoing edges from i.e., there is no edge 
(ti)tj) in E. A node t{ is a source if there are no incoming edges to the node, i.e., there is no edge 
in E . It is shown in [42] that any SP DAG can be parsed as a binary decomposition tree 
(BDT). Figure 3 illustrates a series-parallel graph, and the BDT that represents the graph. The 
internal nodes are labeled Si or P, to denote the series or parallel composition. There is a one-to-one 
coirespondence between BDT leaves and DAG nodes. Each internal BDT node a represents either 
a series (labeled S) or parallel (labeled P) composition of two SP subgraphs represented by the 
subtiees rooted in a. For example, suppose ft’s subtrees are simply leaf nodes. The corresponding 
nodes in the DAG are SP graphs, composed by the operation specified in ft’s label, a can be thought 
to be representing that composition. Now if a’s BDT parent is some node q and q has another 
child a , then we know that a! represents an SP subgraph of the original DAG, and q represents 
the series or parallel composition of the subgraphs represented by a and by ft'. A BDT thus shows 
the selection and ordering of compositions necessary to establish that the original DAG is SP with 
respect to the definition above. 

There is an obvious correspondence between SP compositions and the methods we have devel- 
oped to compute response time functions for series and parallel task structures. If we think of an 
SP DAG’s nodes as representing tasks, a series composition corresponds to the aggregation of two 
tasks into a series task structure: two tasks are replaced by one, and the serial edge between them 
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(a) A series-parallel graph 

I 


S2 



P\ <4 



h h 

(b) Binary decomposition tree 
Figure 3: A Series- Parallel Graph and corresponding BDT 
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Task aggregates 

5 

Number of processors 
6 7 

8 

p 1 

16 

14 

11 

11 

parallel:(ti,^2) 

(2,3) 

(3,3) 

(3,4) 

(4,4) 

Si 

31 

26 

21.5 

19.4 

serial:^!, t 3 ) 

(2,2,1) 

(2,3,1) 

(2,3,2) 

(2,3,3) 

P2 

10 

10 

9 

8 

parallel:^, £5) 

(3,2) 

(3,3) 

(4,3) 

(5,3) 

G = S 2 

70 

59 

51 

46 

serial: (sx,/> 2 ) 

(1,1, 1,1,1) 

(1,2, 1,1,1) 

(2, 2, 1,1,1) 

(2, 3, 1,1,1) 


Table 2: Computation of Response times for series-parallel structures 

disappears. Similarly, a parallel composition corresponds to the aggregation of a set of tasks into a 
parallel task structure. It is thus quite straightforward to construct the response time function for 
a series-parallel graph, once the associated BDT is known. Starting with the individual tasks’ re- 
sponse time functions, we compose response-time functions in the order specified by the BDT. The 
response time functions created during intermediate steps represent aggregate subtasks in much 
the same way as task Tj represented an entire subtree in Section §5.1. Likewise, the optimal as- 
signment is recovered by backtracking through intermediate optimal assignments in the same way 
as was described for trees. 

An application of our procedure, for the series-parallel graph in Figure 3, is shown in Table 2 for 
throughput constraint A = 1/40. Each row shows the response time function, and corresponding 
assignment, for the aggregate task formed by a series or parallel composition. For example, the 
row labeled S\ corresponds to the aggregate task formed by the series composition of Pi (which is 
a parallel composition of and and £ 3 . The minimum response time in the above assignment 
is 46 (assigning 2 piocessojs to t\ y 3 to if 2 and one processor each to ^3,^4 and £5) and the achieved 
throughput is 1/20. 

Once the BDT is known, the cost of determining the optimal assignment is 0(np 2 ), as every 
response-time function composition has cost 0 (p 2 ) ; there are at most n such compositions per- 
foimed. As we have seen before, the cost is reduced to 0(P 2 ?ilogn) when the irreducible minima 
Z{ for each s,- satisfies Z{ < P. It is shown in [42] that a BDT can be constructed time proportional 

to the number of edges which is 0(n 2 ) time. Since we assume n < p, the 0(np 2 ) analysis cost 
dominates the procedure. 

6 The Throughput Problem 

In computations where the input data rates must be maximized to handle real time constraints, the 
objective of the system is to achieve a high throughput. Typically, there is a limit on the amount 
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of time the system can take to process a single data set, i.e., the response time. Under these 
conditions the objective of an assignment becomes maximization of the throughput subject to a 
specified response time requirement. We have referred to this problem as the throughput problem. 
In this section we show how solutions to the response-time problem can be used to solve the 
throughput problem. If one can solve the response-time problem for a given pipeline computation 
in 0(C(n,p)) time, then one can solve its throughput problem in O(nplog(pn) + log (np)C(n,p)) 

Our approach depends on the fact that minimal response times behave monotonically with 
respect to the throughput constraint. 

Lemma 6.1 For any pipeline computation V =< K,T,F,G >, let p{ A) be the minimal possible 
response time ofV, given throughput constraint X. Then p{ A) is a monotone non-decreasing function 
of A. 

Proof: Recall that u\(U) is the minimum number of processors required for task t t to meet 

throughput constraint A. For every U, ux(U) is clearly a monotone non-decreasing function of 
A. Call an assignment A \-fcasible if, for all i — 1 , . . .,n it assigns at least u\(t t ) processors 
to U. Finally, let Ax be the set of all A-feasible assignments. Whenever Ai < A 2 , we must 
have A\ 2 C because of the monotonicity of each «a(U)- Since p( A) is the minimum cost 
among all assignments in A\, we have p(A 2 ) < ® 

j 

This result can be viewed as a generalization of Bokhari’s graph-based argument for monotonicity 
J of the minimal “sum” cost, given a “bottleneck” cost [5]. 

Suppose for a given pipeline computation we are able to solve for p( A), given any A. The set of 
all possible throughput values is {1 //,*(&) | i = 1 , . - M n; k = 1 , - . 5 O(pnlog(pn)) time is needed 

to sort them. Now suppose a response time constraint p is given. For any given throughput A we 
may compute p(A), and determine whether p{ A) < /5. p( A) is monotone in A, which permits us to 
perform a binary search over the sorted space of throughputs and identify the greatest one, say A , 
- f or which p(A*) < p . The assignment associated with p{ A*) is the one maximizing throughput using 

p processors, subject to response time constraint p . If the cost of solving one response-time problem 
is 0{C(n,p)), then the cost of solving the throughput problem is 0(pn\og(pn) + C(n,p)\og(pn)). 

Lemma 6.2 Let 'P be a pipeline computation } and suppose that the complexity of solving the 
response-time problem for V is 0(C(n,p)). Then the complexity of solving the throughput problem 
for V is 0{pn\og(pn) + C{n^p^) log(pn)). 

When solving the response time problem, we typically compute an entiie response time function, 
which essentially gives the “answer” (minimal response time) for a whole range of processors. When 
we solve the throughput problem in the manner just described, we compute a single answer, for a 
single processor count. If w r e desire a range of throughputs for a range of piocessois, we need to 
repeat the procedure above once for every processor count. 
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Figure 4: Computation Flow for Motion Estimation 

The complexity of the algorithms for the throughput problem are seen to be higher, by a 
logarithmic factor, than those for the response time problem. For example, the complexity for serial 
task structures is seen to be 0(np 2 log np 2 ) = O(np 2 \ogp) which has increased by a logarithmic 
factor. Future endeavors include the pursuance of more efficient algorithms for the throughput 
problem. 

7 An Application 

In this section we illustrate our methods by considering an application requiring pipelined execution 
- a motion estimation system in computer vision. Motion estimation is an important problem in 
computer vision in which the goal is to characterize the motion of moving objects in a scene. ^From 
a computational point of view, continually generated images from a camera must be processed by 
a number of tasks. In order to process the images (data sets), throughput and response time 
constraints are imposed on the tasks and therefore, the appropriate model of computation is a 
pipeline computation. The application itself is described in detail in [11, 28] It should be noted 
that there are many approaches to solving the motion estimation problem. We are only interested 
in an example, and therefore, the following algorithm is not presented as the only or the best way to 
perform motion estimation. A comprehensive digest of papers on the topic of motion understanding 
can be found in [31]. The following subsection briefly describes the underlying computations. 

7.1 A Motion Estimation System 

Figure 4 shows the task structure of our motion estimation system [11] - a linear task structure. 
The data sets input to the task system are a continuous stream of stereo image pairs of a scene 
containing the moving vehicles. The required output is a list of 3-dimensional points (or features) 
that describe the motion at each time step. 

The system consists of nine major tasks: 

1. Task ti. The first task performs 2-D convolution on the input image pair. The convolution 
window size is an image-size independent input parameter. 

2. Task t 2 . The second task extracts the zero crossings of the convolved image using a thresh- 
olding algorithm. Zero crossings represent edge features in the image. 

3. Task ^3. The third task fits patterns to the edge features by using a template matching 
algorithm. There are 24 possible patterns that can be fit to an edge [21], 
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4. Task f 4 . The fourth task performs a stereo match algorithm to match features from the left 
and right images of the same time frame [28]. To find a match for a feature m the left image 
from the right image, weighted sum of the correlation coefficient and the directional difference 
weight between the feature in the left image and for all the features m the search space of 
the right image are calculated. The feature in the right image that has the maximum total 
weight is considered as the matched feature. Details are provided in [28, 11]. 

5. Tasks t 5 ,t 6 and t 7 . These are similar to t u t 2 and t 3 respectively except that the algorithms 

are applied to stereo images separated in time by wider margins, depending on the desired 
accuracy for cstimatipn, 

6. Task t&. This task performs a time match algorithm between matched features of the left 
image obtained from t 4 and- features of the left image obtained from t 7 . The time match 
process is similar to the stereo match process except for the fact that first stereo match 
guides the time match process and the search space for the time match algorithm is much 

larger. 

7. Task /.y. Finally, the ninth task performs a second stereo match between the left and right 
images of the stereo images from later time frames. The output of t 9 is a set of 3-D feature 
points that describe the motion of an object betw'een the two time frames. 

All nine tasks are repeated for image inputs obtained continuously. In order to represent real-time 
motion estimation at video frame rates the entire process must be completed in 0.0333 seconds. 
The Image Understanding Benchmark [43] has a similar structure of computation flow - several 
tasks must be performed in a sequence in order to recognize an object in the scene and find the 
model that best describes the object. 

7.2 Shared and Distributed Multiprocessors 

All nine tasks were implemented on a distributed memory machine, the Intel iPSC/2 [7] and 
a shared memory machine, the Encore Multimax [15]. The Intel iPSC/2 is a circuit-switched 
hypercube multiprocessor. We used a 32 node iPSC/2 machine. Each node consists of an Intel 
80386 processor and a floating point co-processor together with 4 Mbytes of RAM and and 64 
Kbyte cache. The Encore Multimax 520 is a bus based system installed with eight dual processor 
cards. Each dual incorporates two NS32532 processors each with its of own 256 Kbyte cache of fast 
static RAM. It has 128 Mbytes of shared memory. 


7.3 Implementation Results for Individual Tasks 

We implemented the task system described above using outdoor images [11]. Several methods for 
implementing each algorithm (e.g., block partitioning, dynamic partitioning [11]) were used; for 
each task, we have selected the best performance numbers from these alternatives. The completion 
times for each algorithm were tabulated and are shown in Tables 3 and 4. Note that for each 
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multiprocessor size, the completion times include all the overheads, computation time and com- 
munication time. Therefore, when selecting a partition of processors for a task, the corresponding 
response time will include all the overheads, computation time and communication times (including 
transferring data from one task to the next). The times in the table are only shown for selected 
multiprocessor sizes, although individual tasks can be executed on an arbitrary number of proces- 
sors. Since the sizes of the machines available to us were limited, for the purposes of illustration, 
we extrapolated the completion times for larger machines as shown in the tables. Extrapolation 
was done using the immediate speedup available from the largest multiprocessor. For example, we 
computed the speedup (percentage improvement in response times) going from 16 to 32 processors 
for Intel iPSC/2 and then reduced this number by five percent (the degradation in speedup in the 
range 8 to 32); the resulting number was taken as the speedup going from 32 to 64 processors. The 
portion of each response time table with times for 64, 128 and 256 processors was estimated in this 
manner. It should be noted that the absolute values of completion times have no impact of the 
execution of the assignment algorithms proposed. If individual completion times are different, the 
allocation may be different. The response time functions in both tables are found to be decreasing 
and convex. 

A basic premise of our assignment algorithms is that we can measure response time functions 
of elemental tasks, then accurately compute the response time functions of aggregate tasks. The 
premise was validated on this application — the measured response time function for the entire 
system was found to deviate from the predicted response time function by no more than 5% at any 
processor count. This accuracy is largely due to the fact that the application is compute-bound; the 
computation-to-communication ratio is 100 to 1. Any errors introduced by our simplistic approach 
to communication costs are bound to be low. The accuracy is also due in part to the fact that all 
possible mappings of the pipeline were constructed to avoid shared communication channels — one 
can always embed a chain in a hypercube. Thus, no effects due to channel contention exist in 
the measurements. It remains to see how well our approach predicts response time functions on 
less compute-intensive applications. Nevertheless, applications of the type we consider here are 
practical, and important. 

7.4 Experimental Results 
7.4.1 The Response Time Problem 

The algorithm for serial tasks with convex response time functions (in Section 4) was run using 
Tables 3 and 4 for a range of desired throughput constraints. As an example of the output generated 
by the algorithm, Table 5 shows the processor assignment for individual tasks for various sizes of 
the Intel iPSC/2. The last row of the table also shows the minimum response time for the given 
throughout constraint (A = 0.05 tasks/second). We observe that some throughput conditions 
cannot be met by all sizes of multiprocessors. For example, a throughput of 0.125 tasks/second 
cannot be achieved for a 32 or 64 processor machine but it can be achieved for a 128 or 256 
processor machine for which the minimum response time was observed to be 22.18 and 12.98 
seconds respectively. Furthermore, the achieved throughput for a 128 processor machine was 0.157 
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Table 3: Completion times for individual tasks on the Intel iPSC/2 of various sizes (* indicates 
extrapolated values) 





Response Times for Individual Tasks (Sec.) 



No. of 
. Proc. 

Task 1 

Task 2 

Task 3 

Task 4 

Task 5 

Task 6 

Task 7 

Task 8 

Task 9 

1 

109.0 

6.15 

0.32 

24.67 

109.0 

6.15 

0.32 

129.02 

18.20 

2 

54.76 

3.07 

0.16 

12.52 

54.76 

3.07 

0.16 

67.70 

9.15 

4 

27.51 

1.58 

0.081 

6.32 

27.51 

1.58 

0.081 

34.22 

4.58 

8 

13.88 

0.81 

0.042 

3.22 

13.88 

0.81 

0.042 

17.50 

2.39 

16 

7.07 

0.40 

0.022 

1.76 

7.07 

0.40 

0.042 

10.30 

1.52 

32 

3.78 

0.20 

0.012 

1.01 

3.78 

0.20 

0.012 

6.36 

1.01 

64* 

2.12 

0.11 

0.007 

0.61 

2.12 

0.11 

0.007 

4.13 

0.71 

128* 

1.25 

0.06 

0.004 

0.38 

1.25 

0.06 

0.004 

2.81 

0.52 

256* 

0.77 

0.04 

0.002 

0.26 

0.77 

0.77 

0.04 

0.002 

0.40 


Table 4: Completion times for individual tasks on the Encore Multimax of various sizes (* indicates 
extrapolated values) 


r 



Response Times 

'or Individual Tasks (Sec.) 



No. of 
Proc. 

Task 1 

Task 2 

Task 3 

Task 4 

Task 5 

Task 6 

Task 7 

Task 8 

Task 9 

1 

352.20 

16.54 

0.85 

51.70 

352.20 

16.54 

0.85 

212.00 

25.50 

2 

176.08 

8.33 

0.69 

28.00 

176.08 

8.33 

0.69 

103.77 

13.10 

4 

88.38 

4.26 

0.60 

15.10 

88.38 

4.26 

0.60 

51.70 

7.10 

8 

45.42 

2.14 

0.32 

8.70 

45.42 

2.14 

0.32 

25.98 

4.25 

16 

26.99 

1.23 

0.20 

5.00 

26.99 

1.23 

0.20 

15.23 

2.76 

32* 

16.84 

0.74 

0.13 

3.01 

16.84 

0.74 

0.13 

9.37 

1.88 

64* 

11.03 

0.47 

0.09 

1.91 

11.03 

0.47 

0.09 

6.06 

1.34 

128* 

7.59 

0.31 

0.06 

1.27 

7.59 

0.31 

0.06 

4.11 

1.01 

256* 

5.48 

0.22 

0.05 

0.89 

5.48 

0.22 

0.05 

2.93 

0.80 
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Table 5: An example processor allocation for minimizing response time for several sizes of iPSC/2 
(MRT = Minimum Response Time, Specified Throughput = 0.05 tasks/sec., No. of processors 
allocated to individual tasks are shown) 


Task 

No. 

Multiprocessor Size (No. of Procs.) 

32 

64 

128 

256 

Proc. 

Asgn. 

Time 

(Sec.) 

Proc. 

Asgn. 

Time 

(Sec.) 

Proc. 

Asgn. 

Time 

(Sec.) 

Proc. 

Asgn. 

Time 

(Sec.) 

1 

8 

13.88 

16 

7.07 

32 

3.78 

64 

2.12 

2 

1 

6.15 

2 

3.07 

8 

0.81 

16 

0.40 

3 

1 

0.32 

1 

0.32 

1 

0.32 

2 

0.16 

4 

2 

12.52 

6 

4.77 

8 

3.22 

16 

1.76 

5 

8 

13.88 

16 

7.07 

32 

3.78 

64 

2.12 

6 

1 

6.15 

2 

3.07 

6 

1.19 

12 

0.60 

7 

1 

0.32 

1 

0.32 

1 

0.32 

2 

0.16 

8 

8 

17.50 

16 

10.30 

32 

6.36 

64 

4.13 

9 

2 

9.15 

4 

4.58 

8 

2.39 

16 

1.52 

MRT 


79.87 


40.57 


22.18 


12.98 


tasks/seconds and for a 256 processor machine the achieved throughput was 0.242 tasks/seconds. 

Figure 5 shows the optimal response time function for the entire pipeline computation together 
with the achieved throughput using the hypercube data. As we might expect, the response time 
function is decreasing and the achieved throughput is increasing. Figure 6 shows response times for 
specified throughput of A = 0.05 tasks/second for different hypercube sizes. Along with the response 
time function from Figure 5, two curves are shown to provide a comparison with non-optimal, yet 
simple, heuristics for processor assignment. The first heuristic, called the equal allocation heuristic, 
allocates an equal number of processors to each task, thus ignoring the response time functions of the 
individual tasks (this takes 0(n) time). The second heuristic, called the ratio heuristic, attempts 
to take these functions into account through the use of ratios: initially each task is assigned a 
processor; the remaining processors are distributed in proportion to the quantities /i(l),l £ i < n 
for each of the n tasks (requiring 0(n) time). Our optimal algorithm (O(n logp)) always achieves a 
lower response time than the two simple 0{n) heuristics. Comparing the achieved throughputs in 
Figure 7, it can be observed that the ratio heuristic achieves higher throughput than the optimal 
algorithm because it does not tradeoff throughput for achieving the minimum response time, i.e., 
the heuristic is not guaranteed to satisfy the response-time constraint. The equal allocation strategy 
performs rather poorly as one might expect. 

The tradeoff of response time versus throughput constraint (using optimal response time func- 
tions) is studied in Figures 8 and 9 for a 128- and 256-processor hypercube. Figure 8 shows the 
response time and Figure 9 shows the corresponding achieved throughput as a function of the 
specified throughput. As we can observe, the response time curve follows the throughput curve 


25 




tJ response time 

achievied throughput 



Figure 5: Response Time Problem: Response Time and Achieved Throughput 


Comparison of response times for 
specified throughput=0.05, for 



opt. algo, 
ratio heur. 
equal alloc. 


Figure 6: Response Time Problem: Comparison with heuristics 



Comparison of achieved throughputs for 
specified throughput=0.05, for 



opt. algo 
ratio hour, 
equal alloc. 


Figure 7: Response Time Problem: Achieved throughputs for heuristics 


Comparison of response times for 
128 and 256 processor hypercubes 



P-128 

P-256 


Figure 8: Response Time Problem: Response time with increasing throughput constraint 
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Comparison of achieved throughputs 
for 128 and 256 processor hypercubes 



specified throughput 


Figure 9: Response Time Problem: Achieved throughput with increasing throughput constraint 


response time and ach. throughput for 
specified throughput=0.0125 
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response time 
ach. throughput 


Figure 10: Response Time Problem: Results for Encore 
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Figure 11: Throughput Problem: Throughputs and achieved response times 

in shape; this clearly indicates that the algorithm trades off response time to achieve the specified 
throughput. This is exemplified at high throughput constraints where the minimum response time 
increases significantly in order to achieve the specified throughput. For low values of specified 
throughput, the change in minimum response time is insignificant because the throughput can be 
achieved easily with the given number of processors. For a larger system the knee of the curves 
shifts to the right as expected due to the additional resources (as shown for a 256-processor system). 
Finally, Figure 10 plots the response time as a function of the number of processors for the Encore 
data. The graph is seen to closely resemble Figure 5. To avoid repetition, we do not show further 
results for the Encore. 

7.4.2 The Throughput Problem 

Figure 11 illustrates the maximum throughput obtained and the corresponding achieved response 
time for our task system when the specified response time p — 100 seconds. The results generated 
by the two heuristics described earlier are presented in Figure 12. The optimal algorithm generates 
higher throughputs than achieved by the two heuristics. Figure 13 shows the achieved response 
times when using the heuristics. The ratio heuristic achieves a lower response time than that by 
the optimal algorithm because it does not necessarily satisfy the throughput constraint. 

The tradeoff between response time and throughput is shown once again, this time in the con- 
text of the throughput problem, in Figures 14 and 15 for 128 and 256 processor hypercubes as a 
function of the specified response time. The solid line shows the maximum possible throughput 
when there is no response time constraint. Therefore, for any specified response time, the differ- 
ence between the maximum throughput and unconstrained maximum throughput represents the 
amount of throughput tradeoff to achieve the specified response time. Furthermore, we can observe 
that as the specified response time increases, the difference between the unconstrained maximum 
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Comparison of max. throughput of 
different allocation algorithms for 



opt. algo, 
ratio hour, 
equal alloc. 


Figure 12: Throughput Problem: Throughputs obtained by heuristics 


comparison of achieved response times 
of different allocation algorithms for 
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Figure 13: Throughput Problem: Achieved response times for heuristics 
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Comparison of throughputs for 
128 and 256 processor hypercubes 



Figure 14: Throughput Problem: Maximum throughput with increasing response time constraint 

throughput and throughput reduces because of the weakening of the response time constraint. Be- 
yond a certain point, the response time constraint is so weakened that the maximum unconstrained 
throughput is achieved as shown by the plateau in the throughput curve. This phenomenon is also 
observed in functional pipelines in processor designs where inserting delays in the pipeline stages 
results in higher throughout at the cost of response time [26, 34, 40]. 

8 Summary 

In this paper we have formulated the problem of optimizing the performance of a pipeline computa- 
tion, represented by a task structure, on a parallel architecture, given a large supply of processors, 
and the experimentally determined response time functions for its constituent tasks. Unlike prior 
treatments of the mapping problem we considered the case where there are many more processors 
than tasks and where tasks are not queued or scheduled. We considered the dual problems of min- 
imizing response time subject to a throughput constraint, and maximizing throughput subject to 
a response time constraint. As we observed in our sample application, these problems aie compli- 
mentary, in the sense that allocation to increase throughput may have the side effect of increasing 
response time, and vice versa. 

The problem posed in this paper was shown to be solvable in polynomial time for a useful class 
of task structures. Specifically we presented 0(np 2 ) algorithms (where n is the number of tasks 
and p is the number of processors), for the response time problem, for the cases where the task 
structures are linear, tree-structured and series- parallel graphs. The algorithms designed for the 
response time problem can be used to solve the throughput problem with an additional logarithmic 
factor in complexity. To place the work in a realistic setting we considered an application, stereo 
image matching on two parallel architectures, and evaluated the performance of our assignment 
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Comparison of achieved response times 
for 128 and 256 processor hypercubes 



Figure 15: Throughput Problem: Achieved response times with increasing response time constraint 

algorithms. Future endeavors include the provision of algorithms for general task structures and 
investigation of faster and parallelized assignment algorithms. 
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